Systematic review of accelerometer-based methods for 24-h physical behavior assessment in young children (0–5 years old)

Background Accurate accelerometer-based methods are required for assessment of 24-h physical behavior in young children. We aimed to summarize evidence on measurement properties of accelerometer-based methods for assessing 24-h physical behavior in young children. Methods We searched PubMed (MEDLINE) up to June 2021 for studies evaluating reliability or validity of accelerometer-based methods for assessing physical activity (PA), sedentary behavior (SB), or sleep in 0–5-year-olds. Studies using a subjective comparison measure or an accelerometer-based device that did not directly output time series data were excluded. We developed a Checklist for Assessing the Methodological Quality of studies using Accelerometer-based Methods (CAMQAM) inspired by COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN). Results Sixty-two studies were included, examining conventional cut-point-based methods or multi-parameter methods. For infants (0—12 months), several multi-parameter methods proved valid for classifying SB and PA. From three months of age, methods were valid for identifying sleep. In toddlers (1—3 years), cut-points appeared valid for distinguishing SB and light PA (LPA) from moderate-to-vigorous PA (MVPA). One multi-parameter method distinguished toddler specific SB. For sleep, no studies were found in toddlers. In preschoolers (3—5 years), valid hip and wrist cut-points for assessing SB, LPA, MVPA, and wrist cut-points for sleep were identified. Several multi-parameter methods proved valid for identifying SB, LPA, and MVPA, and sleep. Despite promising results of multi-parameter methods, few models were open-source. While most studies used a single device or axis to measure physical behavior, more promising results were found when combining data derived from different sensor placements or multiple axes. Conclusions Up to age three, valid cut-points to assess 24-h physical behavior were lacking, while multi-parameter methods proved valid for distinguishing some waking behaviors. For preschoolers, valid cut-points and algorithms were identified for all physical behaviors. Overall, we recommend more high-quality studies evaluating 24-h accelerometer data from multiple sensor placements and axes for physical behavior assessment. Standardized protocols focusing on including well-defined physical behaviors in different settings representative for children’s developmental stage are required. Using our CAMQAM checklist may further improve methodological study quality. PROSPERO Registration number CRD42020184751. Supplementary Information The online version contains supplementary material available at 10.1186/s12966-022-01296-y.

This supplement presents the newly developed Checklist for assessing the Methodological Quality of studies using Accelerometer-based Methods (CAMQAM) and provides the scoring manual. We distinguish two measurement properties: reliability and validity. Each measurement property includes different types, with specific quality aspects. Table 1 summarizes the definitions of the measurement properties of studies evaluating accelerometer-based methods and preferred methods for evaluation. Table 1.

Measurement property Definition Preferred method 1. Reliability
The degree to which a measurement instrument is free from measurement error [1] Study design: at least two measurements; independent measurements; similar measurement conditions; appropriate time interval a) Test-retest reliability The degree to which scores collected from the same participant are the same for repeated measurements Statistical method: Intraclass correlation coefficient (ICC) in accordance with model, type and definition [2], or (w) b) Inter-device reliability The degree to which scores vary that are simultaneously collected using multiple devices at the same placement site Statistical method: ICC in accordance with model, type and definition [2], or correlation (rp, rsp, or r)

Validity
The degree to which an instrument truly measures the construct it purports to measure [3] a) Criterion validity The degree to which the scores of an instrument are an adequate reflection of a gold standard The extent to which scores of an algorithm or a classifier predict scores on a comparator instrument(s) [4] Statistical method: accuracy reported for all classes and confidence interval behavior, e.g., sleep algorithms, machine learning methods) (or related measure such as precision or recall) Abbreviations: ICC intraclass correlation coefficient,  Kappa, w weighted Kappa, r correlation coefficient (unknown), rp correlation coefficient (Pearson), rsp correlation coefficient (Spearman's rank), r unknown correlation coefficient, AUC-ROC area under the receiver operating curve, LoA limits of agreement a Table adapted from Terwee et al. (2010) [3] The following sections present the checklist boxes to assess whether a study meets the standards for good methodological study quality and a guide for the appraisal of these measurement properties.

RELIABILITY Test-retest reliability
This type of reliability of accelerometer-based devices refers to the consistency in the outcome (e.g., physical activity, sedentary behavior, sleep) of accelerometer recordings. An important assumption in test-retest reliability is that the participants wearing the accelerometer-based device are stable in the interim period (i.e., maximum of two weeks) on the construct to be measured between the repeated measurements (rated using items 1-3). Depending on the outcome scores, the preferred statistics to be used are the Intraclass Correlation Coefficient (ICC) (in case of continuous scores) in accordance with model, type, and definition [2] or (weighted) Kappa ((w)) (rated using items 4-8). The checklist for assessing test-retest reliability is presented in Box 1a and explained below. Item 1. Were participants stable in the interim period on the construct to be measured? Evidence that participants were stable could be, for example, an assessment of a global rating of change (e.g., number of hours sleep), completed by the participants or their caregivers. When an intervention was given, in the interim period, one can assume that (many of) the participants have changed their physical behavior. In that case, it is recommended to rate this item as "Inadequate". Item 2. Was the time interval appropriate? The time interval between the accelerometer recordings must be appropriate. It should be short enough to ensure that participants have not substantially changed their physical behavior patterns (e.g., starting to walk, walking with a walker). A time interval of maximal 2 weeks was considered appropriate and rated as "Very good". Item 3. Were the test conditions similar for the measurements? The test conditions should be similar. This refers to the type of administration (e.g., the same axis and of the accelerometer was used, the device was placed on the same location), the setting in which the accelerometer-based device was administered (e.g., home, preschool), and the instructions given. The reliability may be underestimated if these test conditions were not similar.

Item 4. For continuous scores: Was an intraclass correlation coefficient (ICC) calculated?
For continuous scores the ICC is preferred because this statistic incorporates systematic error. The use of correlation coefficients (Pearson (rp) or Spearman's rank (rsp)) is rated as "Doubtful" when it is unclear whether there were systematic differences, as these do not incorporate systematic errors. This item was rated as "Very good" if the ICC was calculated, the model/formula of the ICC was described, and this analysis decision was appropriate according to Koo and Li (2016), e.g., two-way mixed effects model, single measurement or mean of k measurements, absolute agreement, or consistency [2]. Items 5, 6, 7. Was (w) calculated? Was the weighting scheme described? (e.g., linear, quadratic) For dichotomous or ordinal scores, the Cohen's (w) is the preferred statistical method, while for ordinal scores partial chance agreement should be considered, and therefore  should be weighted (i.e., w). In addition, a description of the weights should be included.
Item 8. Were there any other important flaws? Examples of important flaws are when participants were only included if their data were complete or more than 50% of the data were missing.

Inter-device reliability
This type of reliability refers to the consistency in accelerometer-derived (epoch level) data between different devices. The devices should be attached at the same side of the body, as dominance of side influences the accelerometer recordings. An important assumption in inter-device reliability is that device settings such as epoch length and sampling frequency between the devices were similar (rated using item 1). Depending on the type of score, the preferred statistics to be used are the ICC (in case of continuous scores) in accordance with model, type, and definition [2] or (w) (rated using items [4][5][6][7][8] For continuous scores the ICC is preferred because this statistic incorporates systematic error. The use of correlation coefficients (rp or rsp) is rated as "Doubtful" when it is unclear whether there were systematic differences, as these do not incorporate systematic errors. This item was rated as "Very good" if the ICC was calculated, the model/formula of the ICC was described, and this analysis decision was appropriate according to Koo and Li (2016), e.g., one-way/two-way random effects model or two-way mixed effects, single rater or mean of k raters, absolute agreement [2]. Item 3, 4, 5. Was (w) calculated? Was the weighting scheme described? (e.g., linear, quadratic) For dichotomous or ordinal scores, Cohen's  is the preferred statistical method, while for ordinal scores partial chance agreement should be considered, and therefore  should be weighted (i.e., w). In addition, a description of the weights should be included. Item 6. Were there any other important flaws? Examples of important flaws are when participants were only included if their data were complete or more than 50% of the data were missing.

VALIDITY
Note that: if a study assessed the criterion or convergent validity of a specific data analysis approach (e.g., cut-points based or multi-parameter methods), it is required to rate the applicable additional checkbox items besides rating criterion-or convergent validity depending on the comparator instrument.

Criterion validity
An important assumption in criterion validity is that the best reference methods available (i.e., polysomnography for sleep, indirect calorimetry such as doubly labelled water for total energy expenditure) were considered the gold standard to assess validity of the accelerometer-based method. Although these gold standards are not perfect (e.g., doubly labelled water cannot distinguish between type, frequency, and intensity of activities), these are viewed as the best reference measures available. Depending on the type of score, the preferred statistics to be used area under the receiver operating curve (AUC-ROC) (in case of continuous scores) or sensitivity and specificity (rated using items 1-2). The checklist for assessing criterion validity is presented in Box 2a and explained below.
Items 1, 2. Were correlations, or the AUC-ROC calculated? Were sensitivity and specificity determined? When both the accelerometer-derived data and gold standard were analyzed as continuous scores, correlations are the preferred statistical method. When the accelerometer-derived data was analyzed as continuous and the gold standard as dichotomous the AUC-ROC is the preferred statistical method. When both the accelerometer-derived data and gold standard were analyzed as dichotomous scores, sensitivity and specificity are the preferred statistical methods. Note, that if the accelerometer-derived data was analyzed using a multi-parameter method and model scores were reported, item 1 was scored as "NA", as the statistical analyses of the specific data analysis approach were further rated in the applicable additional checkbox.

Item 3. Were there any other important flaws?
The most important flaw for measuring physical behavior is related to the epoch length, as longer epochs are more insensitive to detect changes in type and intensity of physical activity, as well as intermittent behaviors. If the epoch length was not < 60 s for studies that examined validity of an accelerometer-based method for assessment of physical activity without a plausible reason, such as alignment with gold standard, this item was rated as "Doubtful". Another example of a flaw is not calculating a 95% confidence interval for AUC-ROC values.

Convergent validity
If another comparator instrument was used than a "gold standard" to evaluate validity, convergent validity was evaluated. For measuring convergent validity of accelerometer-based methods, typical comparator instruments are direct observation or other accelerometerbased methods (including different device types or analysis approach). An important assumption of convergent validity is that the comparator instrument has sufficient measurement properties (rated using items 1-2). Additionally, it is of importance that studies not only addressed the agreement between the two methods (e.g., correlation) but also evaluated disagreement between the two methods (rated using item 3). The checklist for assessing convergent validity is presented in Box 2b and explained below. These comparator instruments are required to be explained in detail. For example, if the observation scheme was provided and explained, it is recommended to rate item 1 as "Very good". If the observation scheme was not presented or referred to, this item is scored as "Inadequate". In addition, the measurement properties of this comparator instrument need to be sufficient (e.g.,  > .70), preferably tested in a population similar to the study population.
When the comparator instrument was an accelerometer-based device that used orientation classification (thigh data), e.g., activPAL, measuring moderate physical activity, vigorous physical activity, or moderate-to-vigorous physical activity, it is recommended to rate item 2 as "Doubtful", because posture is registered. If this accelerometer-based device that used orientation classification to assess sedentary behavior, item 2 can be rated as "Very good". If an observation scheme was used and interrater agreement was sufficient (i.e.,  > .70), this item can be rated as "Very good".
Item 3. Was the statistical method appropriate for the hypotheses to be tested? As the comparator instrument used is not considered a "gold standard", the study is required to not only address the agreement between the accelerometer-based method and the comparator instrument (e.g., correlation) as well as disagreement between the two methods (i.e., preferably Bland-Altman plots with LoA) [7]. Note, that if additional scoring for the analysis approach is required (e.g., convergent validity of the cut-points based method or multi-parameter method is evaluated) this item can be scored as "NA", because the statistical approach is rated in the applicable additional checkbox.  [1,5,6]